From d3f4f0710857bfb71f06267e8d0a1d8abfa691aa Mon Sep 17 00:00:00 2001 From: memadi Date: Mon, 8 Jun 2026 11:49:27 -0700 Subject: [PATCH 1/6] add docs for Anonymizer-replace evaluation Signed-off-by: memadi --- docs/concepts/evaluation.md | 170 ++++++++++++++++++++++++++++++++++++ docs/concepts/replace.md | 8 +- mkdocs.yml | 1 + 3 files changed, 178 insertions(+), 1 deletion(-) create mode 100644 docs/concepts/evaluation.md diff --git a/docs/concepts/evaluation.md b/docs/concepts/evaluation.md new file mode 100644 index 00000000..8b6312c5 --- /dev/null +++ b/docs/concepts/evaluation.md @@ -0,0 +1,170 @@ + + + +# Evaluation + +Anonymizer provides LLM-as-judge evaluation for both modes, replace and rewrite, but they work differently: + +| Mode | How evaluation runs | +|------|---------------------| +| **Replace** | Post-hoc, via a separate `Anonymizer.evaluate()` call after `run()` / `preview()`. | +| **Rewrite** | Built into the anonymization pipeline — runs automatically as part of every `run()` / `preview()` call. | + +--- + +## Rewrite evaluation + +Rewrite evaluation is part of the pipeline and runs automatically — there is no separate call. After the rewritten text is generated, an evaluate–repair loop scores each record for **utility** (how much semantic content was preserved) and **leakage mass** (how much sensitive information survived). Records that exceed the leakage threshold are sent back for repair, up to `max_repair_iterations` times. A final judge then produces a qualitative assessment and flags records that still need human review. + +The key output columns are `utility_score`, `leakage_mass`, `weighted_leakage_rate`, `any_high_leaked`, and `needs_human_review`. See [Rewrite](rewrite.md) for the more details. + +--- + +## Replace evaluation + +Replace evaluation is **optional and post-hoc** — you call `Anonymizer.evaluate()` on a result from `run()` or `preview()`. The replace mode is read directly from the result object, so you don't restate it: + +```python +from anonymizer import Anonymizer, AnonymizerConfig, AnonymizerInput, Substitute + +anonymizer = Anonymizer() +cfg = AnonymizerConfig(replace=Substitute()) +src = AnonymizerInput(data="data.csv", text_column="text") + +result = anonymizer.run(config=cfg, data=src) +evaluated = anonymizer.evaluate(result) +evaluated.display_record(0) +``` + +You can also save a `preview()` result and evaluate it in a separate session: + +```python +import pickle + +preview = anonymizer.preview(config=cfg, data=src, num_records=15) + +with open("/tmp/preview.pkl", "wb") as f: + pickle.dump(preview, f) + +# … later … +with open("/tmp/preview.pkl", "rb") as f: + loaded = pickle.load(f) + +evaluated = anonymizer.evaluate(loaded) +``` + +All judges run per record. Records with no detected entities are skipped — judges return `valid=True` with an empty invalid list, meaning there was nothing to evaluate, not that quality was confirmed. The three replace judges additionally require a replacement map, so they only run on records processed by Substitute. + +--- + +## Detection validity + +> "Are the detected entities actually correct (value, label) pairs in context?" + +This judge runs during replace evaluation regardless of which replace mode was used. It looks at each detected span and flags: + +- **false_positive** — the span is not actually identifying or sensitive in this context (common word, generic phrase, boilerplate). +- **wrong_label** — the span is sensitive but the label sits in a clearly different domain (e.g. a company name labeled `first_name`). Sibling labels within the same broad domain are treated as valid. +- **not_in_text** — the literal value does not appear in the original text. +- **wrong_boundary** — the span is a clear partial or over-extended capture (omits part of the actual value, or absorbs surrounding function words). Descriptive words in natural prose around a bare entity value are not a boundary error. +- **contextual_mismatch** — the span refers to something other than the labeled entity type in this context (e.g. "Apple" as fruit labeled `company_name`). + +| Output column | Type | Description | +|---|---|---| +| `detection_valid` | `bool \| None` | `True` if all detections pass; `None` if the judge was unavailable. | +| `detection_invalid_entities` | `list` | Each flagged detection with value, label, and one-sentence reasoning. | + +--- + +## Replace judges + +When the source result used the **Substitute** mode, three additional LLM judges run in parallel — one per quality dimension. + +### Type fidelity + +> "Does each synthetic value still belong to the same entity class and match the expected format for that class?" + +The judge checks that replacements are shape-compatible with their originals — same granularity and character class — anchored by what the original itself looks like. It does **not** check semantic attributes (gender, age bucket) or cross-entity consistency; those are separate metrics. + +| Output column | Type | Description | +|---|---|---| +| `type_fidelity_valid` | `bool \| None` | `True` if all replacements pass; `None` if the judge was unavailable. | +| `type_fidelity_invalid_replacements` | `list` | Each failing replacement with label, original, synthetic, and reasoning. | + +### Attribute fidelity + +> "Does each synthetic value preserve the salient within-entity attributes of the original?" + +The judge checks attributes including: + +- **Gender of name** — applies to `first_name`, `last_name`, `user_name`. Only checked when the original name clearly implies a gender. Adjacent or ambiguous cases pass. +- **Age bucket** — applies to `age` and `date_of_birth`. Buckets: child (0–12), teen (13–19), young adult (20–29), adult (30–44), middle-aged (45–64), senior (65+). Adjacent buckets pass; only clear flips (adult → child) fail. + +All other labels are skipped — their attributes are either handled by other metrics or too unreliable to judge automatically. + +| Output column | Type | Description | +|---|---|---| +| `attribute_fidelity_valid` | `bool \| None` | `True` if all checked attributes pass; `None` if unavailable. | +| `attribute_fidelity_invalid_entities` | `list` | Each failing entity with attributes checked and reasoning. | + +### Relational consistency + +> "Do the synthetic entities preserve the same relational coherence with each other that the originals had?" + +The judge inspects cross-entity relationships within a record — for example, whether a synthetic city is actually located in the synthetic state, or whether a synthetic date of birth is consistent with a synthetic age. Records with no checkable relationships always pass. + +Relationships inspected include geographic pairings (city ↔ state, city ↔ postcode), temporal coherence (date of birth ↔ age), and name–email alignment. + +| Output column | Type | Description | +|---|---|---| +| `relational_consistency_valid` | `bool \| None` | `True` if all relations pass; `None` if unavailable. | +| `relational_consistency_invalid_relations` | `list` | Each failing relation with participants and reasoning. | + +--- + +## Reading replace evaluation results + +`display_record()` renders a formatted per-record view that includes all four judge verdicts alongside the replacement map: + +```python +evaluated.display_record(0) +``` + +For a tabular overview across all records: + +```python +evaluated.dataframe[ + [ + "detection_valid", + "type_fidelity_valid", + "attribute_fidelity_valid", + "relational_consistency_valid", + ] +] +``` + +Use `trace_dataframe` for the full internal trace including raw judge outputs. + +--- + +## Model roles + +All four judges default to `gpt-oss-120b`. Defaults are defined in [`evaluate.yaml`](https://github.com/NVIDIA-NeMo/Anonymizer/blob/main/src/anonymizer/config/default_model_configs/evaluate.yaml). Override them by passing a `model_configs` YAML to `Anonymizer(model_configs=...)` — see [Models](models.md) for the full override pattern. + +| Role | Default | Purpose | +|------|---------|---------| +| `detection_validity_judge` | `gpt-oss-120b` | Checks detected (value, label) pairs for correctness. | +| `replace_type_fidelity_judge` | `gpt-oss-120b` | Checks entity class and format preservation. | +| `replace_attribute_fidelity_judge` | `gpt-oss-120b` | Checks within-entity attribute preservation. | +| `replace_relational_consistency_judge` | `gpt-oss-120b` | Checks cross-entity coherence within a record. | + +```yaml +# my_models.yaml +selected_models: + evaluate: + detection_validity_judge: your-model-alias + replace_type_fidelity_judge: your-model-alias + replace_attribute_fidelity_judge: your-model-alias + replace_relational_consistency_judge: your-model-alias +``` + diff --git a/docs/concepts/replace.md b/docs/concepts/replace.md index 69c30d8f..1e5f87d9 100644 --- a/docs/concepts/replace.md +++ b/docs/concepts/replace.md @@ -116,4 +116,10 @@ AnonymizerConfig(replace=Hash(algorithm="sha1", digest_length=8)) |-------|---------|-------------| | `algorithm` | `sha256` | Hash algorithm (`sha256`, `sha1`, or `md5`). | | `digest_length` | `12` | Number of hex characters to keep (6--64). | -| `format_template` | `` | Template with `{digest}` required; `{label}` optional. | \ No newline at end of file +| `format_template` | `` | Template with `{digest}` required; `{label}` optional. | + +--- + +## Evaluating replace output + +After running `replace`, you can score the quality of substitutions using LLM-as-judge evaluation. See [Evaluation](evaluation.md) for details on all four judges (detection validity, type fidelity, attribute fidelity, relational consistency) and how to call `Anonymizer.evaluate()`. \ No newline at end of file diff --git a/mkdocs.yml b/mkdocs.yml index df44ad60..8b4f54ad 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -159,6 +159,7 @@ nav: - Replace: concepts/replace.md - Rewrite: concepts/rewrite.md - Choosing a Strategy: concepts/choosing-a-strategy.md + - Evaluation: concepts/evaluation.md - Self-hosting GLiNER: concepts/self-hosting-gliner.md - Troubleshooting: troubleshooting.md - Tutorials: From 50262c8899d024edd9deef43340731f20b528d1b Mon Sep 17 00:00:00 2001 From: memadi Date: Mon, 8 Jun 2026 11:55:01 -0700 Subject: [PATCH 2/6] nit Signed-off-by: memadi --- docs/concepts/evaluation.md | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/docs/concepts/evaluation.md b/docs/concepts/evaluation.md index 8b6312c5..19371e61 100644 --- a/docs/concepts/evaluation.md +++ b/docs/concepts/evaluation.md @@ -53,11 +53,12 @@ with open("/tmp/preview.pkl", "rb") as f: evaluated = anonymizer.evaluate(loaded) ``` -All judges run per record. Records with no detected entities are skipped — judges return `valid=True` with an empty invalid list, meaning there was nothing to evaluate, not that quality was confirmed. The three replace judges additionally require a replacement map, so they only run on records processed by Substitute. +Four LLM judges run: one that scores detection quality and three that score replacement quality (Substitute mode only). Note that all 4 scores are assigned per record. --- +### Entity Detection judge: -## Detection validity +### Detection validity > "Are the detected entities actually correct (value, label) pairs in context?" @@ -76,7 +77,7 @@ This judge runs during replace evaluation regardless of which replace mode was u --- -## Replace judges +### Entity Replacment judges When the source result used the **Substitute** mode, three additional LLM judges run in parallel — one per quality dimension. From 5a4eca9b5ae3ad99d3aee8359dbf4bb8f51f62b3 Mon Sep 17 00:00:00 2001 From: Marjan Emadi Date: Mon, 8 Jun 2026 12:11:07 -0700 Subject: [PATCH 3/6] Update docs/concepts/evaluation.md Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> --- docs/concepts/evaluation.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/concepts/evaluation.md b/docs/concepts/evaluation.md index 19371e61..08c45556 100644 --- a/docs/concepts/evaluation.md +++ b/docs/concepts/evaluation.md @@ -77,7 +77,7 @@ This judge runs during replace evaluation regardless of which replace mode was u --- -### Entity Replacment judges +### Entity Replacement judges When the source result used the **Substitute** mode, three additional LLM judges run in parallel — one per quality dimension. From 7ea4c4a3777c493a260b89d3d844cc25afa054a8 Mon Sep 17 00:00:00 2001 From: Marjan Emadi Date: Mon, 8 Jun 2026 12:12:16 -0700 Subject: [PATCH 4/6] Update docs/concepts/evaluation.md Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> --- docs/concepts/evaluation.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/concepts/evaluation.md b/docs/concepts/evaluation.md index 08c45556..3585cdb1 100644 --- a/docs/concepts/evaluation.md +++ b/docs/concepts/evaluation.md @@ -16,7 +16,7 @@ Anonymizer provides LLM-as-judge evaluation for both modes, replace and rewrite, Rewrite evaluation is part of the pipeline and runs automatically — there is no separate call. After the rewritten text is generated, an evaluate–repair loop scores each record for **utility** (how much semantic content was preserved) and **leakage mass** (how much sensitive information survived). Records that exceed the leakage threshold are sent back for repair, up to `max_repair_iterations` times. A final judge then produces a qualitative assessment and flags records that still need human review. -The key output columns are `utility_score`, `leakage_mass`, `weighted_leakage_rate`, `any_high_leaked`, and `needs_human_review`. See [Rewrite](rewrite.md) for the more details. +The key output columns are `utility_score`, `leakage_mass`, `weighted_leakage_rate`, `any_high_leaked`, and `needs_human_review`. See [Rewrite](rewrite.md) for more details. --- From dfed68551433337521efd01a66bd945f76f5d4c9 Mon Sep 17 00:00:00 2001 From: memadi Date: Mon, 8 Jun 2026 13:13:46 -0700 Subject: [PATCH 5/6] address feedback Signed-off-by: memadi --- docs/concepts/evaluation.md | 51 +++++++++++++++++-------------------- docs/concepts/replace.md | 2 +- 2 files changed, 24 insertions(+), 29 deletions(-) diff --git a/docs/concepts/evaluation.md b/docs/concepts/evaluation.md index 3585cdb1..64ab8752 100644 --- a/docs/concepts/evaluation.md +++ b/docs/concepts/evaluation.md @@ -8,21 +8,13 @@ Anonymizer provides LLM-as-judge evaluation for both modes, replace and rewrite, | Mode | How evaluation runs | |------|---------------------| | **Replace** | Post-hoc, via a separate `Anonymizer.evaluate()` call after `run()` / `preview()`. | -| **Rewrite** | Built into the anonymization pipeline — runs automatically as part of every `run()` / `preview()` call. | +| **Rewrite** | Runs automatically as part of every `run()` / `preview()` call. A dedicated post-hoc `evaluate()` call, matching replace mode, is planned for a future release. | --- -## Rewrite evaluation +## Replace Evaluation -Rewrite evaluation is part of the pipeline and runs automatically — there is no separate call. After the rewritten text is generated, an evaluate–repair loop scores each record for **utility** (how much semantic content was preserved) and **leakage mass** (how much sensitive information survived). Records that exceed the leakage threshold are sent back for repair, up to `max_repair_iterations` times. A final judge then produces a qualitative assessment and flags records that still need human review. - -The key output columns are `utility_score`, `leakage_mass`, `weighted_leakage_rate`, `any_high_leaked`, and `needs_human_review`. See [Rewrite](rewrite.md) for more details. - ---- - -## Replace evaluation - -Replace evaluation is **optional and post-hoc** — you call `Anonymizer.evaluate()` on a result from `run()` or `preview()`. The replace mode is read directly from the result object, so you don't restate it: +Replace evaluation is **optional and post-hoc** — you call `Anonymizer.evaluate()` on a result from `run()` or `preview()`: ```python from anonymizer import Anonymizer, AnonymizerConfig, AnonymizerInput, Substitute @@ -36,7 +28,7 @@ evaluated = anonymizer.evaluate(result) evaluated.display_record(0) ``` -You can also save a `preview()` result and evaluate it in a separate session: +Both `run()` and `preview()` results can be saved and evaluated in a separate session: ```python import pickle @@ -53,16 +45,17 @@ with open("/tmp/preview.pkl", "rb") as f: evaluated = anonymizer.evaluate(loaded) ``` -Four LLM judges run: one that scores detection quality and three that score replacement quality (Substitute mode only). Note that all 4 scores are assigned per record. +Four LLM judges run per record: one that scores detection quality and three that score replacement quality (Substitute mode only). --- -### Entity Detection judge: -### Detection validity +### Entity Detection Judge + +#### Detection Validity > "Are the detected entities actually correct (value, label) pairs in context?" -This judge runs during replace evaluation regardless of which replace mode was used. It looks at each detected span and flags: +This judge runs regardless of which replace mode was used. It looks at each detected span and flags: - **false_positive** — the span is not actually identifying or sensitive in this context (common word, generic phrase, boilerplate). - **wrong_label** — the span is sensitive but the label sits in a clearly different domain (e.g. a company name labeled `first_name`). Sibling labels within the same broad domain are treated as valid. @@ -77,11 +70,11 @@ This judge runs during replace evaluation regardless of which replace mode was u --- -### Entity Replacement judges +### Entity Replacement Judges When the source result used the **Substitute** mode, three additional LLM judges run in parallel — one per quality dimension. -### Type fidelity +#### Type Fidelity > "Does each synthetic value still belong to the same entity class and match the expected format for that class?" @@ -92,23 +85,23 @@ The judge checks that replacements are shape-compatible with their originals — | `type_fidelity_valid` | `bool \| None` | `True` if all replacements pass; `None` if the judge was unavailable. | | `type_fidelity_invalid_replacements` | `list` | Each failing replacement with label, original, synthetic, and reasoning. | -### Attribute fidelity +#### Attribute Fidelity > "Does each synthetic value preserve the salient within-entity attributes of the original?" -The judge checks attributes including: +The judge checks two attributes: - **Gender of name** — applies to `first_name`, `last_name`, `user_name`. Only checked when the original name clearly implies a gender. Adjacent or ambiguous cases pass. - **Age bucket** — applies to `age` and `date_of_birth`. Buckets: child (0–12), teen (13–19), young adult (20–29), adult (30–44), middle-aged (45–64), senior (65+). Adjacent buckets pass; only clear flips (adult → child) fail. -All other labels are skipped — their attributes are either handled by other metrics or too unreliable to judge automatically. +All other labels are outside the scope of this metric. | Output column | Type | Description | |---|---|---| | `attribute_fidelity_valid` | `bool \| None` | `True` if all checked attributes pass; `None` if unavailable. | | `attribute_fidelity_invalid_entities` | `list` | Each failing entity with attributes checked and reasoning. | -### Relational consistency +#### Relational Consistency > "Do the synthetic entities preserve the same relational coherence with each other that the originals had?" @@ -152,12 +145,7 @@ Use `trace_dataframe` for the full internal trace including raw judge outputs. All four judges default to `gpt-oss-120b`. Defaults are defined in [`evaluate.yaml`](https://github.com/NVIDIA-NeMo/Anonymizer/blob/main/src/anonymizer/config/default_model_configs/evaluate.yaml). Override them by passing a `model_configs` YAML to `Anonymizer(model_configs=...)` — see [Models](models.md) for the full override pattern. -| Role | Default | Purpose | -|------|---------|---------| -| `detection_validity_judge` | `gpt-oss-120b` | Checks detected (value, label) pairs for correctness. | -| `replace_type_fidelity_judge` | `gpt-oss-120b` | Checks entity class and format preservation. | -| `replace_attribute_fidelity_judge` | `gpt-oss-120b` | Checks within-entity attribute preservation. | -| `replace_relational_consistency_judge` | `gpt-oss-120b` | Checks cross-entity coherence within a record. | +The four roles are `detection_validity_judge`, `replace_type_fidelity_judge`, `replace_attribute_fidelity_judge`, and `replace_relational_consistency_judge`. ```yaml # my_models.yaml @@ -169,3 +157,10 @@ selected_models: replace_relational_consistency_judge: your-model-alias ``` +--- + +## Rewrite Evaluation + +Rewrite evaluation is part of the pipeline and runs automatically — there is no separate call. After the rewritten text is generated, an evaluate–repair loop scores each record for **utility** (how much semantic content was preserved) and **leakage mass** (how much sensitive information survived). Records that exceed the leakage threshold are sent back for repair, up to `max_repair_iterations` times. A final judge then produces a qualitative assessment and flags records that still need human review. + +The key output columns are `utility_score`, `leakage_mass`, `weighted_leakage_rate`, `any_high_leaked`, and `needs_human_review`. See [Rewrite](rewrite.md) for more details. diff --git a/docs/concepts/replace.md b/docs/concepts/replace.md index 1e5f87d9..19adad6b 100644 --- a/docs/concepts/replace.md +++ b/docs/concepts/replace.md @@ -122,4 +122,4 @@ AnonymizerConfig(replace=Hash(algorithm="sha1", digest_length=8)) ## Evaluating replace output -After running `replace`, you can score the quality of substitutions using LLM-as-judge evaluation. See [Evaluation](evaluation.md) for details on all four judges (detection validity, type fidelity, attribute fidelity, relational consistency) and how to call `Anonymizer.evaluate()`. \ No newline at end of file +After running `replace`, you can score the quality of substitutions using LLM-as-judge evaluation. See [Evaluation](evaluation.md) for details on all four judges (detection validity, type fidelity, attribute fidelity, relational consistency) and how to call `Anonymizer.evaluate()`. From 6f7febe24c799b4784ac44bb89ffa83f1ad27a98 Mon Sep 17 00:00:00 2001 From: Marjan Emadi Date: Tue, 9 Jun 2026 17:57:20 -0700 Subject: [PATCH 6/6] Update docs/concepts/evaluation.md Co-authored-by: lipikaramaswamy <31832945+lipikaramaswamy@users.noreply.github.com> --- docs/concepts/evaluation.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/concepts/evaluation.md b/docs/concepts/evaluation.md index 64ab8752..b45661a2 100644 --- a/docs/concepts/evaluation.md +++ b/docs/concepts/evaluation.md @@ -21,7 +21,7 @@ from anonymizer import Anonymizer, AnonymizerConfig, AnonymizerInput, Substitute anonymizer = Anonymizer() cfg = AnonymizerConfig(replace=Substitute()) -src = AnonymizerInput(data="data.csv", text_column="text") +src = AnonymizerInput(source="data.csv", text_column="text") result = anonymizer.run(config=cfg, data=src) evaluated = anonymizer.evaluate(result)