From d3f4f0710857bfb71f06267e8d0a1d8abfa691aa Mon Sep 17 00:00:00 2001
From: memadi <memadi@nvidia.com>
Date: Mon, 8 Jun 2026 11:49:27 -0700
Subject: [PATCH 1/6] add docs for Anonymizer-replace evaluation

Signed-off-by: memadi <memadi@nvidia.com>
---
 docs/concepts/evaluation.md | 170 ++++++++++++++++++++++++++++++++++++
 docs/concepts/replace.md    |   8 +-
 mkdocs.yml                  |   1 +
 3 files changed, 178 insertions(+), 1 deletion(-)
 create mode 100644 docs/concepts/evaluation.md

diff --git a/docs/concepts/evaluation.md b/docs/concepts/evaluation.md
new file mode 100644
index 00000000..8b6312c5
--- /dev/null
+++ b/docs/concepts/evaluation.md
@@ -0,0 +1,170 @@
+<!-- SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -->
+<!-- SPDX-License-Identifier: Apache-2.0 -->
+
+# Evaluation
+
+Anonymizer provides LLM-as-judge evaluation for both modes, replace and rewrite, but they work differently:
+
+| Mode | How evaluation runs |
+|------|---------------------|
+| **Replace** | Post-hoc, via a separate `Anonymizer.evaluate()` call after `run()` / `preview()`. |
+| **Rewrite** | Built into the anonymization pipeline — runs automatically as part of every `run()` / `preview()` call. |
+
+---
+
+## Rewrite evaluation
+
+Rewrite evaluation is part of the pipeline and runs automatically — there is no separate call. After the rewritten text is generated, an evaluate–repair loop scores each record for **utility** (how much semantic content was preserved) and **leakage mass** (how much sensitive information survived). Records that exceed the leakage threshold are sent back for repair, up to `max_repair_iterations` times. A final judge then produces a qualitative assessment and flags records that still need human review.
+
+The key output columns are `utility_score`, `leakage_mass`, `weighted_leakage_rate`, `any_high_leaked`, and `needs_human_review`. See [Rewrite](rewrite.md) for the more details.
+
+---
+
+## Replace evaluation
+
+Replace evaluation is **optional and post-hoc** — you call `Anonymizer.evaluate()` on a result from `run()` or `preview()`. The replace mode is read directly from the result object, so you don't restate it:
+
+```python
+from anonymizer import Anonymizer, AnonymizerConfig, AnonymizerInput, Substitute
+
+anonymizer = Anonymizer()
+cfg = AnonymizerConfig(replace=Substitute())
+src = AnonymizerInput(data="data.csv", text_column="text")
+
+result = anonymizer.run(config=cfg, data=src)
+evaluated = anonymizer.evaluate(result)
+evaluated.display_record(0)
+```
+
+You can also save a `preview()` result and evaluate it in a separate session:
+
+```python
+import pickle
+
+preview = anonymizer.preview(config=cfg, data=src, num_records=15)
+
+with open("/tmp/preview.pkl", "wb") as f:
+    pickle.dump(preview, f)
+
+# … later …
+with open("/tmp/preview.pkl", "rb") as f:
+    loaded = pickle.load(f)
+
+evaluated = anonymizer.evaluate(loaded)
+```
+
+All judges run per record. Records with no detected entities are skipped — judges return `valid=True` with an empty invalid list, meaning there was nothing to evaluate, not that quality was confirmed. The three replace judges additionally require a replacement map, so they only run on records processed by Substitute.
+
+---
+
+## Detection validity
+
+> "Are the detected entities actually correct (value, label) pairs in context?"
+
+This judge runs during replace evaluation regardless of which replace mode was used. It looks at each detected span and flags:
+
+- **false_positive** — the span is not actually identifying or sensitive in this context (common word, generic phrase, boilerplate).
+- **wrong_label** — the span is sensitive but the label sits in a clearly different domain (e.g. a company name labeled `first_name`). Sibling labels within the same broad domain are treated as valid.
+- **not_in_text** — the literal value does not appear in the original text.
+- **wrong_boundary** — the span is a clear partial or over-extended capture (omits part of the actual value, or absorbs surrounding function words). Descriptive words in natural prose around a bare entity value are not a boundary error.
+- **contextual_mismatch** — the span refers to something other than the labeled entity type in this context (e.g. "Apple" as fruit labeled `company_name`).
+
+| Output column | Type | Description |
+|---|---|---|
+| `detection_valid` | `bool \| None` | `True` if all detections pass; `None` if the judge was unavailable. |
+| `detection_invalid_entities` | `list` | Each flagged detection with value, label, and one-sentence reasoning. |
+
+---
+
+## Replace judges
+
+When the source result used the **Substitute** mode, three additional LLM judges run in parallel — one per quality dimension.
+
+### Type fidelity
+
+> "Does each synthetic value still belong to the same entity class and match the expected format for that class?"
+
+The judge checks that replacements are shape-compatible with their originals — same granularity and character class — anchored by what the original itself looks like. It does **not** check semantic attributes (gender, age bucket) or cross-entity consistency; those are separate metrics.
+
+| Output column | Type | Description |
+|---|---|---|
+| `type_fidelity_valid` | `bool \| None` | `True` if all replacements pass; `None` if the judge was unavailable. |
+| `type_fidelity_invalid_replacements` | `list` | Each failing replacement with label, original, synthetic, and reasoning. |
+
+### Attribute fidelity
+
+> "Does each synthetic value preserve the salient within-entity attributes of the original?"
+
+The judge checks attributes including:
+
+- **Gender of name** — applies to `first_name`, `last_name`, `user_name`. Only checked when the original name clearly implies a gender. Adjacent or ambiguous cases pass.
+- **Age bucket** — applies to `age` and `date_of_birth`. Buckets: child (0–12), teen (13–19), young adult (20–29), adult (30–44), middle-aged (45–64), senior (65+). Adjacent buckets pass; only clear flips (adult → child) fail.
+
+All other labels are skipped — their attributes are either handled by other metrics or too unreliable to judge automatically.
+
+| Output column | Type | Description |
+|---|---|---|
+| `attribute_fidelity_valid` | `bool \| None` | `True` if all checked attributes pass; `None` if unavailable. |
+| `attribute_fidelity_invalid_entities` | `list` | Each failing entity with attributes checked and reasoning. |
+
+### Relational consistency
+
+> "Do the synthetic entities preserve the same relational coherence with each other that the originals had?"
+
+The judge inspects cross-entity relationships within a record — for example, whether a synthetic city is actually located in the synthetic state, or whether a synthetic date of birth is consistent with a synthetic age. Records with no checkable relationships always pass.
+
+Relationships inspected include geographic pairings (city ↔ state, city ↔ postcode), temporal coherence (date of birth ↔ age), and name–email alignment.
+
+| Output column | Type | Description |
+|---|---|---|
+| `relational_consistency_valid` | `bool \| None` | `True` if all relations pass; `None` if unavailable. |
+| `relational_consistency_invalid_relations` | `list` | Each failing relation with participants and reasoning. |
+
+---
+
+## Reading replace evaluation results
+
+`display_record()` renders a formatted per-record view that includes all four judge verdicts alongside the replacement map:
+
+```python
+evaluated.display_record(0)
+```
+
+For a tabular overview across all records:
+
+```python
+evaluated.dataframe[
+    [
+        "detection_valid",
+        "type_fidelity_valid",
+        "attribute_fidelity_valid",
+        "relational_consistency_valid",
+    ]
+]
+```
+
+Use `trace_dataframe` for the full internal trace including raw judge outputs.
+
+---
+
+## Model roles
+
+All four judges default to `gpt-oss-120b`. Defaults are defined in [`evaluate.yaml`](https://github.com/NVIDIA-NeMo/Anonymizer/blob/main/src/anonymizer/config/default_model_configs/evaluate.yaml). Override them by passing a `model_configs` YAML to `Anonymizer(model_configs=...)` — see [Models](models.md) for the full override pattern.
+
+| Role | Default | Purpose |
+|------|---------|---------|
+| `detection_validity_judge` | `gpt-oss-120b` | Checks detected (value, label) pairs for correctness. |
+| `replace_type_fidelity_judge` | `gpt-oss-120b` | Checks entity class and format preservation. |
+| `replace_attribute_fidelity_judge` | `gpt-oss-120b` | Checks within-entity attribute preservation. |
+| `replace_relational_consistency_judge` | `gpt-oss-120b` | Checks cross-entity coherence within a record. |
+
+```yaml
+# my_models.yaml
+selected_models:
+  evaluate:
+    detection_validity_judge: your-model-alias
+    replace_type_fidelity_judge: your-model-alias
+    replace_attribute_fidelity_judge: your-model-alias
+    replace_relational_consistency_judge: your-model-alias
+```
+
diff --git a/docs/concepts/replace.md b/docs/concepts/replace.md
index 69c30d8f..1e5f87d9 100644
--- a/docs/concepts/replace.md
+++ b/docs/concepts/replace.md
@@ -116,4 +116,10 @@ AnonymizerConfig(replace=Hash(algorithm="sha1", digest_length=8))
 |-------|---------|-------------|
 | `algorithm` | `sha256` | Hash algorithm (`sha256`, `sha1`, or `md5`). |
 | `digest_length` | `12` | Number of hex characters to keep (6--64). |
-| `format_template` | `<HASH_{label}_{digest}>` | Template with `{digest}` required; `{label}` optional. |
\ No newline at end of file
+| `format_template` | `<HASH_{label}_{digest}>` | Template with `{digest}` required; `{label}` optional. |
+
+---
+
+## Evaluating replace output
+
+After running `replace`, you can score the quality of substitutions using LLM-as-judge evaluation. See [Evaluation](evaluation.md) for details on all four judges (detection validity, type fidelity, attribute fidelity, relational consistency) and how to call `Anonymizer.evaluate()`.
\ No newline at end of file
diff --git a/mkdocs.yml b/mkdocs.yml
index df44ad60..8b4f54ad 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -159,6 +159,7 @@ nav:
       - Replace: concepts/replace.md
       - Rewrite: concepts/rewrite.md
       - Choosing a Strategy: concepts/choosing-a-strategy.md
+      - Evaluation: concepts/evaluation.md
       - Self-hosting GLiNER: concepts/self-hosting-gliner.md
   - Troubleshooting: troubleshooting.md
   - Tutorials:

From 50262c8899d024edd9deef43340731f20b528d1b Mon Sep 17 00:00:00 2001
From: memadi <memadi@nvidia.com>
Date: Mon, 8 Jun 2026 11:55:01 -0700
Subject: [PATCH 2/6] nit

Signed-off-by: memadi <memadi@nvidia.com>
---
 docs/concepts/evaluation.md | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/docs/concepts/evaluation.md b/docs/concepts/evaluation.md
index 8b6312c5..19371e61 100644
--- a/docs/concepts/evaluation.md
+++ b/docs/concepts/evaluation.md
@@ -53,11 +53,12 @@ with open("/tmp/preview.pkl", "rb") as f:
 evaluated = anonymizer.evaluate(loaded)
 ```
 
-All judges run per record. Records with no detected entities are skipped — judges return `valid=True` with an empty invalid list, meaning there was nothing to evaluate, not that quality was confirmed. The three replace judges additionally require a replacement map, so they only run on records processed by Substitute.
+Four LLM judges run: one that scores detection quality and three that score replacement quality (Substitute mode only). Note that all 4 scores are assigned per record.
 
 ---
+### Entity Detection judge:
 
-## Detection validity
+### Detection validity
 
 > "Are the detected entities actually correct (value, label) pairs in context?"
 
@@ -76,7 +77,7 @@ This judge runs during replace evaluation regardless of which replace mode was u
 
 ---
 
-## Replace judges
+### Entity Replacment judges
 
 When the source result used the **Substitute** mode, three additional LLM judges run in parallel — one per quality dimension.
 

From 5a4eca9b5ae3ad99d3aee8359dbf4bb8f51f62b3 Mon Sep 17 00:00:00 2001
From: Marjan Emadi <memadi@nvidia.com>
Date: Mon, 8 Jun 2026 12:11:07 -0700
Subject: [PATCH 3/6] Update docs/concepts/evaluation.md

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
---
 docs/concepts/evaluation.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/concepts/evaluation.md b/docs/concepts/evaluation.md
index 19371e61..08c45556 100644
--- a/docs/concepts/evaluation.md
+++ b/docs/concepts/evaluation.md
@@ -77,7 +77,7 @@ This judge runs during replace evaluation regardless of which replace mode was u
 
 ---
 
-### Entity Replacment judges
+### Entity Replacement judges
 
 When the source result used the **Substitute** mode, three additional LLM judges run in parallel — one per quality dimension.
 

From 7ea4c4a3777c493a260b89d3d844cc25afa054a8 Mon Sep 17 00:00:00 2001
From: Marjan Emadi <memadi@nvidia.com>
Date: Mon, 8 Jun 2026 12:12:16 -0700
Subject: [PATCH 4/6] Update docs/concepts/evaluation.md

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
---
 docs/concepts/evaluation.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/concepts/evaluation.md b/docs/concepts/evaluation.md
index 08c45556..3585cdb1 100644
--- a/docs/concepts/evaluation.md
+++ b/docs/concepts/evaluation.md
@@ -16,7 +16,7 @@ Anonymizer provides LLM-as-judge evaluation for both modes, replace and rewrite,
 
 Rewrite evaluation is part of the pipeline and runs automatically — there is no separate call. After the rewritten text is generated, an evaluate–repair loop scores each record for **utility** (how much semantic content was preserved) and **leakage mass** (how much sensitive information survived). Records that exceed the leakage threshold are sent back for repair, up to `max_repair_iterations` times. A final judge then produces a qualitative assessment and flags records that still need human review.
 
-The key output columns are `utility_score`, `leakage_mass`, `weighted_leakage_rate`, `any_high_leaked`, and `needs_human_review`. See [Rewrite](rewrite.md) for the more details.
+The key output columns are `utility_score`, `leakage_mass`, `weighted_leakage_rate`, `any_high_leaked`, and `needs_human_review`. See [Rewrite](rewrite.md) for more details.
 
 ---
 

From dfed68551433337521efd01a66bd945f76f5d4c9 Mon Sep 17 00:00:00 2001
From: memadi <memadi@nvidia.com>
Date: Mon, 8 Jun 2026 13:13:46 -0700
Subject: [PATCH 5/6] address feedback

Signed-off-by: memadi <memadi@nvidia.com>
---
 docs/concepts/evaluation.md | 51 +++++++++++++++++--------------------
 docs/concepts/replace.md    |  2 +-
 2 files changed, 24 insertions(+), 29 deletions(-)

diff --git a/docs/concepts/evaluation.md b/docs/concepts/evaluation.md
index 3585cdb1..64ab8752 100644
--- a/docs/concepts/evaluation.md
+++ b/docs/concepts/evaluation.md
@@ -8,21 +8,13 @@ Anonymizer provides LLM-as-judge evaluation for both modes, replace and rewrite,
 | Mode | How evaluation runs |
 |------|---------------------|
 | **Replace** | Post-hoc, via a separate `Anonymizer.evaluate()` call after `run()` / `preview()`. |
-| **Rewrite** | Built into the anonymization pipeline — runs automatically as part of every `run()` / `preview()` call. |
+| **Rewrite** | Runs automatically as part of every `run()` / `preview()` call. A dedicated post-hoc `evaluate()` call, matching replace mode, is planned for a future release. |
 
 ---
 
-## Rewrite evaluation
+## Replace Evaluation
 
-Rewrite evaluation is part of the pipeline and runs automatically — there is no separate call. After the rewritten text is generated, an evaluate–repair loop scores each record for **utility** (how much semantic content was preserved) and **leakage mass** (how much sensitive information survived). Records that exceed the leakage threshold are sent back for repair, up to `max_repair_iterations` times. A final judge then produces a qualitative assessment and flags records that still need human review.
-
-The key output columns are `utility_score`, `leakage_mass`, `weighted_leakage_rate`, `any_high_leaked`, and `needs_human_review`. See [Rewrite](rewrite.md) for more details.
-
----
-
-## Replace evaluation
-
-Replace evaluation is **optional and post-hoc** — you call `Anonymizer.evaluate()` on a result from `run()` or `preview()`. The replace mode is read directly from the result object, so you don't restate it:
+Replace evaluation is **optional and post-hoc** — you call `Anonymizer.evaluate()` on a result from `run()` or `preview()`:
 
 ```python
 from anonymizer import Anonymizer, AnonymizerConfig, AnonymizerInput, Substitute
@@ -36,7 +28,7 @@ evaluated = anonymizer.evaluate(result)
 evaluated.display_record(0)
 ```
 
-You can also save a `preview()` result and evaluate it in a separate session:
+Both `run()` and `preview()` results can be saved and evaluated in a separate session:
 
 ```python
 import pickle
@@ -53,16 +45,17 @@ with open("/tmp/preview.pkl", "rb") as f:
 evaluated = anonymizer.evaluate(loaded)
 ```
 
-Four LLM judges run: one that scores detection quality and three that score replacement quality (Substitute mode only). Note that all 4 scores are assigned per record.
+Four LLM judges run per record: one that scores detection quality and three that score replacement quality (Substitute mode only).
 
 ---
-### Entity Detection judge:
 
-### Detection validity
+### Entity Detection Judge
+
+#### Detection Validity
 
 > "Are the detected entities actually correct (value, label) pairs in context?"
 
-This judge runs during replace evaluation regardless of which replace mode was used. It looks at each detected span and flags:
+This judge runs regardless of which replace mode was used. It looks at each detected span and flags:
 
 - **false_positive** — the span is not actually identifying or sensitive in this context (common word, generic phrase, boilerplate).
 - **wrong_label** — the span is sensitive but the label sits in a clearly different domain (e.g. a company name labeled `first_name`). Sibling labels within the same broad domain are treated as valid.
@@ -77,11 +70,11 @@ This judge runs during replace evaluation regardless of which replace mode was u
 
 ---
 
-### Entity Replacement judges
+### Entity Replacement Judges
 
 When the source result used the **Substitute** mode, three additional LLM judges run in parallel — one per quality dimension.
 
-### Type fidelity
+#### Type Fidelity
 
 > "Does each synthetic value still belong to the same entity class and match the expected format for that class?"
 
@@ -92,23 +85,23 @@ The judge checks that replacements are shape-compatible with their originals —
 | `type_fidelity_valid` | `bool \| None` | `True` if all replacements pass; `None` if the judge was unavailable. |
 | `type_fidelity_invalid_replacements` | `list` | Each failing replacement with label, original, synthetic, and reasoning. |
 
-### Attribute fidelity
+#### Attribute Fidelity
 
 > "Does each synthetic value preserve the salient within-entity attributes of the original?"
 
-The judge checks attributes including:
+The judge checks two attributes:
 
 - **Gender of name** — applies to `first_name`, `last_name`, `user_name`. Only checked when the original name clearly implies a gender. Adjacent or ambiguous cases pass.
 - **Age bucket** — applies to `age` and `date_of_birth`. Buckets: child (0–12), teen (13–19), young adult (20–29), adult (30–44), middle-aged (45–64), senior (65+). Adjacent buckets pass; only clear flips (adult → child) fail.
 
-All other labels are skipped — their attributes are either handled by other metrics or too unreliable to judge automatically.
+All other labels are outside the scope of this metric.
 
 | Output column | Type | Description |
 |---|---|---|
 | `attribute_fidelity_valid` | `bool \| None` | `True` if all checked attributes pass; `None` if unavailable. |
 | `attribute_fidelity_invalid_entities` | `list` | Each failing entity with attributes checked and reasoning. |
 
-### Relational consistency
+#### Relational Consistency
 
 > "Do the synthetic entities preserve the same relational coherence with each other that the originals had?"
 
@@ -152,12 +145,7 @@ Use `trace_dataframe` for the full internal trace including raw judge outputs.
 
 All four judges default to `gpt-oss-120b`. Defaults are defined in [`evaluate.yaml`](https://github.com/NVIDIA-NeMo/Anonymizer/blob/main/src/anonymizer/config/default_model_configs/evaluate.yaml). Override them by passing a `model_configs` YAML to `Anonymizer(model_configs=...)` — see [Models](models.md) for the full override pattern.
 
-| Role | Default | Purpose |
-|------|---------|---------|
-| `detection_validity_judge` | `gpt-oss-120b` | Checks detected (value, label) pairs for correctness. |
-| `replace_type_fidelity_judge` | `gpt-oss-120b` | Checks entity class and format preservation. |
-| `replace_attribute_fidelity_judge` | `gpt-oss-120b` | Checks within-entity attribute preservation. |
-| `replace_relational_consistency_judge` | `gpt-oss-120b` | Checks cross-entity coherence within a record. |
+The four roles are `detection_validity_judge`, `replace_type_fidelity_judge`, `replace_attribute_fidelity_judge`, and `replace_relational_consistency_judge`.
 
 ```yaml
 # my_models.yaml
@@ -169,3 +157,10 @@ selected_models:
     replace_relational_consistency_judge: your-model-alias
 ```
 
+---
+
+## Rewrite Evaluation
+
+Rewrite evaluation is part of the pipeline and runs automatically — there is no separate call. After the rewritten text is generated, an evaluate–repair loop scores each record for **utility** (how much semantic content was preserved) and **leakage mass** (how much sensitive information survived). Records that exceed the leakage threshold are sent back for repair, up to `max_repair_iterations` times. A final judge then produces a qualitative assessment and flags records that still need human review.
+
+The key output columns are `utility_score`, `leakage_mass`, `weighted_leakage_rate`, `any_high_leaked`, and `needs_human_review`. See [Rewrite](rewrite.md) for more details.
diff --git a/docs/concepts/replace.md b/docs/concepts/replace.md
index 1e5f87d9..19adad6b 100644
--- a/docs/concepts/replace.md
+++ b/docs/concepts/replace.md
@@ -122,4 +122,4 @@ AnonymizerConfig(replace=Hash(algorithm="sha1", digest_length=8))
 
 ## Evaluating replace output
 
-After running `replace`, you can score the quality of substitutions using LLM-as-judge evaluation. See [Evaluation](evaluation.md) for details on all four judges (detection validity, type fidelity, attribute fidelity, relational consistency) and how to call `Anonymizer.evaluate()`.
\ No newline at end of file
+After running `replace`, you can score the quality of substitutions using LLM-as-judge evaluation. See [Evaluation](evaluation.md) for details on all four judges (detection validity, type fidelity, attribute fidelity, relational consistency) and how to call `Anonymizer.evaluate()`.

From 6f7febe24c799b4784ac44bb89ffa83f1ad27a98 Mon Sep 17 00:00:00 2001
From: Marjan Emadi <memadi@nvidia.com>
Date: Tue, 9 Jun 2026 17:57:20 -0700
Subject: [PATCH 6/6] Update docs/concepts/evaluation.md

Co-authored-by: lipikaramaswamy <31832945+lipikaramaswamy@users.noreply.github.com>
---
 docs/concepts/evaluation.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/concepts/evaluation.md b/docs/concepts/evaluation.md
index 64ab8752..b45661a2 100644
--- a/docs/concepts/evaluation.md
+++ b/docs/concepts/evaluation.md
@@ -21,7 +21,7 @@ from anonymizer import Anonymizer, AnonymizerConfig, AnonymizerInput, Substitute
 
 anonymizer = Anonymizer()
 cfg = AnonymizerConfig(replace=Substitute())
-src = AnonymizerInput(data="data.csv", text_column="text")
+src = AnonymizerInput(source="data.csv", text_column="text")
 
 result = anonymizer.run(config=cfg, data=src)
 evaluated = anonymizer.evaluate(result)