Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
155 changes: 150 additions & 5 deletions docs/concepts/evaluation.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ Anonymizer provides LLM-as-judge evaluation for both modes, replace and rewrite,
| Mode | How evaluation runs |
|------|---------------------|
| **Replace** | Post-hoc, via a separate `Anonymizer.evaluate()` call after `run()` / `preview()`. |
| **Rewrite** | Runs automatically as part of every `run()` / `preview()` call. A dedicated post-hoc `evaluate()` call, matching replace mode, is planned for a future release. |
| **Rewrite** | Automatic leakage/utility scoring runs as part of every `run()` / `preview()` call. A separate `Anonymizer.evaluate()` call adds LLM-as-judge quality scoring. |

---

Expand Down Expand Up @@ -68,6 +68,15 @@ This judge runs regardless of which replace mode was used. It looks at each dete
| `detection_valid` | `bool \| None` | `True` if all detections pass; `None` if the judge was unavailable. |
| `detection_invalid_entities` | `list` | Each flagged detection with value, label, and one-sentence reasoning. |

**Special values:**

| Scenario | `detection_valid` | Display | Log |
|---|---|---|---|
| No entities detected in this record | `True` | Satisfied | `INFO`: "N passthrough row(s) have no detected entities — detection_valid set to True (trivially valid)" |
| Judge ran and all detections passed | `True` | Satisfied | — |
| Judge ran and flagged one or more detections | `False` | Not Satisfied / Partially Satisfied | — |
| Judge call failed or returned a malformed response | `None` | Unavailable | — |

---

### Entity Replacement Judges
Expand Down Expand Up @@ -116,7 +125,7 @@ Relationships inspected include geographic pairings (city ↔ state, city ↔ po

---

## Reading replace evaluation results
### Reading replace evaluation results

`display_record()` renders a formatted per-record view that includes all four judge verdicts alongside the replacement map:

Expand All @@ -141,7 +150,7 @@ Use `trace_dataframe` for the full internal trace including raw judge outputs.

---

## Model roles
### Model roles

All four judges default to `gpt-oss-120b`. Defaults are defined in [`evaluate.yaml`](https://github.com/NVIDIA-NeMo/Anonymizer/blob/main/src/anonymizer/config/default_model_configs/evaluate.yaml). Override them by passing a `model_configs` YAML to `Anonymizer(model_configs=...)` — see [Models](models.md) for the full override pattern.

Expand All @@ -161,6 +170,142 @@ selected_models:

## Rewrite Evaluation

Rewrite evaluation is part of the pipeline and runs automatically — there is no separate call. After the rewritten text is generated, an evaluate–repair loop scores each record for **utility** (how much semantic content was preserved) and **leakage mass** (how much sensitive information survived). Records that exceed the leakage threshold are sent back for repair, up to `max_repair_iterations` times. A final judge then produces a qualitative assessment and flags records that still need human review.
Rewrite evaluation has two layers:

1. **Automatic (always runs)** — leakage mass, utility score, weighted leakage rate, and `needs_human_review` are computed as part of every `run()` / `preview()` call. See [Rewrite](rewrite.md) for the repair loop and output columns.

2. **Post-hoc LLM judges (optional)** — call `Anonymizer.evaluate()` on a completed rewrite result to add the entity detection judge and three holistic quality rubrics.

```python
from anonymizer import Anonymizer, AnonymizerConfig, AnonymizerInput, Rewrite

anonymizer = Anonymizer()
cfg = AnonymizerConfig(rewrite=Rewrite())
src = AnonymizerInput(source="data.csv", text_column="text")

result = anonymizer.run(config=cfg, data=src)
evaluated = anonymizer.evaluate(result)
evaluated.display_record(0)
```

Both `run()` and `preview()` results can be saved and evaluated in a separate session:

```python
import pickle

preview = anonymizer.preview(config=cfg, data=src, num_records=15)

with open("/tmp/preview.pkl", "wb") as f:
pickle.dump(preview, f)

# … later …
with open("/tmp/preview.pkl", "rb") as f:
loaded = pickle.load(f)

evaluated = anonymizer.evaluate(loaded)
```

---

### Entity Detection Judge

Same judge as in replace mode — see [Entity Detection Judge](#entity-detection-judge) above. In rewrite mode, `detection_valid` is returned as a **0–1 fraction** (the share of detected entities that passed), rather than a boolean. A value of `1.0` means all detections are valid; lower values mean more entities were flagged — the value itself is the fraction that passed.

| Output column | Type | Description |
|---|---|---|
| `detection_valid` | `float \| None` | 1.0 if all detections pass; fraction of valid entities otherwise; `None` if the score is unavailable. |
| `detection_invalid_entities` | `list` | Each flagged detection with value, label, and one-sentence reasoning. |

**Special values:**

| Scenario | `detection_valid` | Display | Log |
|---|---|---|---|
| No entities detected in this record | `1.0` | 1.00 | `INFO`: "N passthrough row(s) have no detected entities — detection_valid set to 1.0 (trivially valid)" |
| Judge ran and all detections passed | `1.0` | 1.00 | — |
| Judge ran and flagged one or more detections | 0–1 fraction | numeric score | — |
| Judge call failed or entity data unreadable | `None` | Unavailable | `WARNING`: "Could not parse entities_by_value to compute detection_valid fraction" |

---

The key output columns are `utility_score`, `leakage_mass`, `weighted_leakage_rate`, `any_high_leaked`, and `needs_human_review`. See [Rewrite](rewrite.md) for more details.
### Rewrite Quality Judges

Three rubrics evaluate the holistic quality of the rewritten text. All three run as a single LLM judge call and are stored together under `judge_evaluation`.

#### Privacy

> "Does the rewritten text adequately remove linkage risk to the original record?"

Scores residual linkage risk after the rewrite — comparing rewritten values to originals, distinguishing direct identifiers from quasi-identifiers, and assessing whether remaining details narrow the candidate set of plausible matches.

| Score | Meaning |
|-------|---------|
| `high` | Original direct identifiers removed; remaining quasi-identifiers create low linkage risk. |
| `medium` | No obvious direct identifiers remain, but a distinctive quasi-identifier bundle creates noticeable linkage risk. |
| `low` | Easily or near-certainly linkable — direct identifiers remain or enough detail survives that re-identification requires minimal effort. |

#### Quality

> "How well does the rewritten text preserve important meaning, facts, and structure?"

Evaluates content preservation independent of privacy and style. Changes made for privacy reasons are not penalized when the core meaning is intact.

| Score | Meaning |
|-------|---------|
| `high` | Important meaning, facts, and structure fully preserved. |
| `medium` | Most content preserved; minor details lost or slightly distorted. |
| `low` | Material loss of important information, contradictions, or distorted core meaning. |

#### Style

> "Does the rewritten text read as fluent, coherent, and human-written prose?"

Evaluates readability, grammatical correctness, clarity, and phrasing — independent of content changes.

| Score | Meaning |
|-------|---------|
| `high` | Fluent, coherent, human-written prose. |
| `medium` | Mostly readable; isolated awkward phrasing or stiff transitions. |
| `low` | Noticeably unnatural; broken grammar, placeholder-like language, or machine-generated feel. |

The three rubric scores are stored together under the `judge_evaluation` column as a dict:

```python
# Example judge_evaluation value for a single record
{
"privacy": {"score": "high", "reasoning": "All direct identifiers removed..."},
"quality": {"score": "medium", "reasoning": "Key facts preserved but some details lost..."},
"style": {"score": "high", "reasoning": "Reads naturally throughout..."},
}
```

---

### Reading rewrite evaluation results

`display_record()` renders a formatted per-record view that includes the detection validity fraction and all three judge rubrics alongside the rewritten text:

```python
evaluated.display_record(0)
```

For a tabular overview across all records:

```python
evaluated.dataframe[["detection_valid", "judge_evaluation"]]
```

Use `trace_dataframe` for the full internal trace including raw judge outputs.

---

### Model roles

The rewrite quality judge defaults to `nemotron-30b-thinking`. The detection validity judge shares the `detection_validity_judge` role used by replace evaluation. Defaults are defined in [`evaluate.yaml`](https://github.com/NVIDIA-NeMo/Anonymizer/blob/main/src/anonymizer/config/default_model_configs/evaluate.yaml). Override them via `model_configs`:

```yaml
# my_models.yaml
selected_models:
evaluate:
detection_validity_judge: your-model-alias
rewrite_judge: your-model-alias
```
30 changes: 19 additions & 11 deletions docs/concepts/rewrite.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ Instead of replacing individual entities, rewrite mode transforms the entire tex

[Detection](detection.md) runs first (same as [Replace mode](replace.md), plus latent entity detection for context-inferable information). This includes identifying signals that may not be explicitly tagged but can be deduced from combinations of details (e.g., location inferred from contextual cues). The text is then classified by domain, and each entity or attribute is assigned a sensitivity disposition based on contextual risk, recognizing that quasi-identifiers can emerge from any aspect of the text.

The text is then rewritten to reduce identifiability, applying targeted transformations that disrupt inference (e.g., weakening or removing linking details) rather than simply rewording content. The rewritten output is evaluated for both quality and privacy leakage using adversarial testing. If thresholds are exceeded, the system automatically refines the rewrite. A final judge provides a qualitative assessment of the rewritten record. Any records that failed to meet standards are tagged for human review.
The text is then rewritten to reduce identifiability, applying targeted transformations that disrupt inference (e.g., weakening or removing linking details) rather than simply rewording content. The rewritten output is evaluated for both quality and privacy leakage using adversarial testing. If thresholds are exceeded, the system automatically refines the rewrite. Any records that failed to meet standards are tagged for human review.

---

Expand Down Expand Up @@ -128,16 +128,19 @@ config = AnonymizerConfig(

## Output columns

| Column | Description |
|--------|-------------|
| `{text_col}_rewritten` | The privacy-safe rewritten text. |
| `utility_score` | Quality preservation (0.0--1.0). Higher is better. |
| `leakage_mass` | Weighted privacy leakage. Lower is better. |
| `weighted_leakage_rate` | Normalized leakage (0.0--1.0) relative to the maximum possible leakage mass. |
| `any_high_leaked` | Whether any high-sensitivity entity leaked through. |
| `needs_human_review` | Flag for records that may need manual review. |
| Column | When available | Description |
|--------|---------------|-------------|
| `{text_col}_rewritten` | Always | The privacy-safe rewritten text. |
| `utility_score` | Always | Quality preservation (0.0--1.0). Higher is better. |
| `leakage_mass` | Always | Weighted privacy leakage. Lower is better. |
| `weighted_leakage_rate` | Always | Normalized leakage (0.0--1.0) relative to the maximum possible leakage mass. |
| `any_high_leaked` | Always | Whether any high-sensitivity entity leaked through. |
| `needs_human_review` | Always | Flag for records that may need manual review. |
| `detection_valid` | After `evaluate()` | Fraction of detected entities that passed the detection judge (0.0--1.0); `None` if judge unavailable. |
| `detection_invalid_entities` | After `evaluate()` | Flagged detections with value, label, and one-sentence reasoning. |
| `judge_evaluation` | After `evaluate()` | Dict with `privacy`, `quality`, and `style` rubric scores and reasoning. |

Use `preview.trace_dataframe` for the full pipeline trace (domain, disposition, QA pairs, repair iterations, judge evaluation).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since judge_evaluation is explicitly rendered in display_record(), I think it belongs in the same public-facing tier as utility_score, leakage_mass, and detection_valid — all without the underscore prefix- hence removed the internal use, aka _

Use `preview.trace_dataframe` for the full pipeline trace (domain, disposition, QA pairs, repair iterations).

!!! note "No entities? No rewrite."

Expand Down Expand Up @@ -180,4 +183,9 @@ Rewrite uses multiple LLM roles. All default to models in the [default config](m
| `rewriter` | `gpt-oss-120b` | Generates the rewritten text. |
| `evaluator` | `nemotron-30b-thinking` | Evaluates quality and leakage. |
| `repairer` | `gpt-oss-120b` | Repairs high-leakage rewrites. |
| `judge` | `nemotron-30b-thinking` | Final quality/privacy judge. |

---

## Evaluating rewrite output

After running rewrite, you can score detection quality and the holistic rewrite quality using LLM-as-judge evaluation. See [Evaluation](evaluation.md) for details on the detection judge and the three rewrite quality rubrics (privacy, quality, style), and how to call `Anonymizer.evaluate()`.
18 changes: 17 additions & 1 deletion docs/notebook_source/04_rewriting_biographies.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,14 +24,16 @@
# 2. Classifies the domain and assigns sensitivity dispositions
# 3. Generates a rewritten version that obscures sensitive entities
# 4. Evaluates quality (utility) and privacy (leakage) with an automated repair loop
# 5. Runs a final LLM judge for informational scores
# 5. Runs a final optional LLM judge for informational scores
#
#
# #### 📚 What you'll learn
#
# - Configure rewrite mode with `PrivacyGoal` to specify what to protect and what to preserve
# - Set evaluation criteria and risk tolerance for automated quality checks
# - Preview rewritten text and inspect utility / leakage scores
# - Triage flagged records with `needs_human_review`
# - Run `evaluate()` for detection validity and holistic judge scores (privacy, quality, style)
#
# > **Tip:** First time running notebooks? Start with
# > [setup instructions](https://nvidia-nemo.github.io/Anonymizer/latest/tutorials/).
Expand Down Expand Up @@ -153,11 +155,25 @@
print(f"{len(flagged)} of {len(df)} records flagged for human review")
flagged.head()

# %% [markdown]
# ## 🔬 Evaluate (optional)
#
# Call `evaluate()` to run LLM-as-judge scoring on the rewrite result — detection validity and three quality rubrics (privacy, quality, style).
# See [Evaluation](../../concepts/evaluation/#rewrite-evaluation) for details.

# %%
evaluated = anonymizer.evaluate(result)

# %%
evaluated.display_record(0)

# %% [markdown]
# ## ⏭️ Next steps
#
# - **[⚖️ Rewriting Legal Documents](../05_rewriting_legal_documents/)** --
# rewrite legal text with custom entity labels and domain-specific privacy goals.
# - **[📊 Evaluation](../../concepts/evaluation/#rewrite-evaluation)** --
# learn about the detection validity and rewrite quality judges in detail.
# - **[🎯 Choosing a Replacement Strategy](../03_choosing_a_replacement_strategy/)** --
# compare Redact, Annotate, Hash, and Substitute if you prefer token-level replacement.
# - **[🔍 Inspecting Detected Entities](../02_inspecting_detected_entities/)** --
Expand Down
14 changes: 14 additions & 0 deletions docs/notebook_source/05_rewriting_legal_documents.py
Original file line number Diff line number Diff line change
Expand Up @@ -179,9 +179,23 @@
print(f"{len(flagged)} of {len(df)} records flagged for human review")
flagged.head()

# %% [markdown]
# ## 🔬 Evaluate (optional)
#
# Call `evaluate()` to run LLM-as-judge scoring on the rewrite result — detection validity and three quality rubrics (privacy, quality, style).
# See [Evaluation](../../concepts/evaluation/#rewrite-evaluation) for details.

# %%
evaluated = anonymizer.evaluate(result)

# %%
evaluated.display_record(0)

# %% [markdown]
# ## ⏭️ Next steps
#
# - **[📊 Evaluation](../../concepts/evaluation/#rewrite-evaluation)** --
# learn about the detection validity and rewrite quality judges in detail.
# - **[🔍 Inspecting Detected Entities](../02_inspecting_detected_entities/)** --
# debug what the detection pipeline found before rewriting.
# - **Try it on your own data!** Swap in your CSV, define entity labels for your
Expand Down
41 changes: 38 additions & 3 deletions docs/notebooks/04_rewriting_biographies.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -14,14 +14,16 @@
"2. Classifies the domain and assigns sensitivity dispositions\n",
"3. Generates a rewritten version that obscures sensitive entities\n",
"4. Evaluates quality (utility) and privacy (leakage) with an automated repair loop\n",
"5. Runs a final LLM judge for informational scores\n",
"\n",
"After `run()`, call `Anonymizer.evaluate()` for optional LLM-as-judge scoring.\n",
"\n",
"#### 📚 What you'll learn\n",
"\n",
"- Configure rewrite mode with `PrivacyGoal` to specify what to protect and what to preserve\n",
"- Set evaluation criteria and risk tolerance for automated quality checks\n",
"- Preview rewritten text and inspect utility / leakage scores\n",
"- Triage flagged records with `needs_human_review`\n",
"- Run `evaluate()` for detection validity and holistic judge scores (privacy, quality, style)\n",
"\n",
"> **Tip:** First time running notebooks? Start with\n",
"> [setup instructions](https://nvidia-nemo.github.io/Anonymizer/latest/tutorials/)."
Expand Down Expand Up @@ -755,6 +757,37 @@
"flagged.head()"
]
},
{
"cell_type": "markdown",
"id": "e1ad0026",
"metadata": {},
"source": [
"## 🔬 Evaluate (optional)\n",
"\n",
"Call `evaluate()` to run LLM-as-judge scoring on the rewrite result — detection validity and three quality rubrics (privacy, quality, style).\n",
"See [Evaluation](../../concepts/evaluation/#rewrite-evaluation) for details."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a36d3c26",
"metadata": {},
"outputs": [],
"source": [
"evaluated = anonymizer.evaluate(result)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "13126cd1",
"metadata": {},
"outputs": [],
"source": [
"evaluated.display_record(0)"
]
},
{
"cell_type": "markdown",
"id": "e601cc9d",
Expand All @@ -764,6 +797,8 @@
"\n",
"- **[⚖️ Rewriting Legal Documents](../05_rewriting_legal_documents/)** --\n",
" rewrite legal text with custom entity labels and domain-specific privacy goals.\n",
"- **[📊 Evaluation](../../concepts/evaluation/#rewrite-evaluation)** --\n",
" learn about the detection validity and rewrite quality judges in detail.\n",
"- **[🎯 Choosing a Replacement Strategy](../03_choosing_a_replacement_strategy/)** --\n",
" compare Redact, Annotate, Hash, and Substitute if you prefer token-level replacement.\n",
"- **[🔍 Inspecting Detected Entities](../02_inspecting_detected_entities/)** --\n",
Expand All @@ -773,7 +808,7 @@
],
"metadata": {
"kernelspec": {
"display_name": ".venv",
"display_name": ".venv (3.11.13)",
"language": "python",
"name": "python3"
},
Expand All @@ -787,7 +822,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.9"
"version": "3.11.13"
}
},
"nbformat": 4,
Expand Down
Loading
Loading